Hypertext Categorization using Hyperlink Patterns and Meta Data

نویسندگان

  • Rayid Ghani
  • Seán Slattery
  • Yiming Yang
چکیده

Hypertext poses new text classi cation research challenges as hyperlinks, content of linked documents, and meta data about related web sites all provide richer sources of information for hypertext classi cation that are not available in traditional text classi cation. We investigate the use of such information for representing web sites, and the e ectiveness of di erent classi ers (Naive Bayes, Nearest Neighbor, and Foil) in exploiting those representations. We nd that using words in web pages alone often yields suboptimal performance of classi ers, compared to exploiting additional sources of information beyond document content. On the other hand, we also observe that linked pages can be more harmful than helpful when the linked neighborhoods are highly \noisy" and that links have to be used in a careful manner. More importantly, our investigation suggests that meta data which is often available, or can be acquired using Information Extraction techniques, can be extremely useful for improving classi cation accuracy. Finally, the relative performance of the di erent classi ers being tested gives us insights into the strengths and limitations of our algorithms for hypertext classi cation.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Hypertext Classification Diploma thesis Hervé

Web Directories have been historically collected and updated by hand but this method is unsatisfactory for three reasons: A team of Web Surfers maintaining such a database should face the gigantism of the World network. Its size would thus be incompatible with the economic constraints of startups. Even the biggest team would not be able to trace all the changes on the Web and to keep the databa...

متن کامل

A Study of Approaches to

Hypertext poses new research challenges for text classiication. Hyperlinks, HTML tags, category labels distributed over linked documents, and meta data extracted from related web sites all provide rich information for classifying hypertext documents. How to appropriately represent that information and automatically learn statistical patterns for solving hypertext classiication problems is an op...

متن کامل

Hypertext Classification Using Tensor Space Model and Rough Set Based Ensemble Classifier

As WWW grows at an increasing speed, a classifier targeted at hypertext has become in high demand. While document categorization is quite a mature, the issue of utilizing hypertext structure and hyperlinks has been relatively unexplored. In this paper, we introduce tensor space model for representing hypertext documents. We exploit the local-structure and neighborhood recommendation encapsulate...

متن کامل

Impact on Performance of Hypertext Classification of Selective Rich HTML Capture

Hypertext categorization is the automatic classification of web documents into predefined classes. It poses new challenges for automatic categorization because of the rich information in a hypertext document. Hyperlinks, HTML tags, and metadata all provide rich information for hypertext categorization that is not available in traditional text classification. This paper looks at (i) what represe...

متن کامل

Gender Patterns in Hypertext Reading

The effect of gender in learning has often been the focus of research because of its potential implications in academic achievement. However, the effect of gender in hypertext reading has not been thoroughly investigated. The Web in general and the hypertext in particular has modified the way people access and use information. This paper reports the findings of an empirical study into gender di...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001